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Abstract 

We consider kernel based learning methods for regression and analyze what happens 

to the risk minimizer when new variables, statistically independent of input and target 

variables, are added to the set of input variables; this problem arises, for example, in the 

detection of causality relations between two time series. We find that the risk minimizer 

remains unchanged if we constrain the risk minimization to hypothesis spaces induced 

by suitable kernel functions. We show that not all kernel induced hypothesis spaces 

enjoy this property. We present sufficient conditions ensuring that the risk minimizer 

does not change, and show that they hold for inhomogeneous polynomial and Gaussian 
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RBF kernels. We also provide examples of kernel induced hypothesis spaces whose risk 
minimizer changes if independent variables are added as input. 

1 Introduction 

Recent advances in kernel-based learning algorithms have brought the field of machine learning 
closer to the goal of autonomy, i.e. the goal of providing learning systems that require as little 
intervention as possible on the part of a human user (Vapnik, 1998). Kernel algorithms work by 
embedding data into a Hilbert space, and searching for linear relations in that space. The embed- 
ding is performed implicitly, by specifying the inner product between pairs of points. Kernel-based 
approaches are generally formulated as convex optimization problems, with a single minimum, and 
thus do not require heuristic choices of learning rates, start configuration or other free parameters. 
On the other hand, the choice of the kernel and the corresponding feature space are central choices 
that generally must be made by a human user. While this provides opportunities to use prior 
knowledge about the problem at hand, in practice it is difficult to find prior justification for the 
use of one kernel instead of another (Shawe- Taylor and Cristianini, 2004). The purpose of this 
work is to introduce a novel property enjoyed by some kernel-based learning machines, which is of 
particular relevance when a machine learning approach is developed to evaluate causality between 
two simultaneously acquired signals: in this paper we define a learning machine to be invariant 
w.r.t. independent variables (property IIV) if it does not change when statistically independent 
variables are added to the set of input variables. We show that the risk minimizer constrained to 
belong to suitable kernel induced hypothesis spaces is IIV. This property holds true for hypothesis 
spaces induced by inhomogeneous polynomial and Gaussian kernel functions. We discuss the case 
of quadratic loss function and provide sufficient conditions for a kernel machine to be IIV. We also 
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present examples of kernels which induce spaces where the risk minimizer is not IIV, and they 
should not be used to measure causality. 

2 Preliminaries 

We focus on the problem of predicting the value of a random variable (r.v.) s G H. with a 
function /(x) of the r.v. vector x G R d . Given a loss function V and a set of functions called the 
hypothesis space TC, the best predictor is sought for in H as the minimizer /* of the prediction 
error or generalization error or risk defined as: 



where p(x, s) is the joint density function of x and s. Given another r.v. y G M. q , let us add y to 
the input variables and define a new vector appending x and y, i.e. z = (x T ,y T ) T . Let us also 
consider the predictor f'*(z) of s, based on the knowledge of the r.v. x and y, minimizing the risk: 



If y is statistically independent of x and s, it is intuitive to require that /*(x) and f'*(z) coincide 
and have the same risk. Indeed in this case y variables do not convey any information on the 
problem at hand. The property stated above is important when predictors are used to identify 
causal relations among simultaneously acquired signals, an important problem with applications 
in many fields ranging from economy to physiology (see, e.g., Ancona et al., 2004, and references 
therein). The major approach to this problem examines if the prediction of one series could be 
improved by incorporating information of the other, as proposed by Granger, 1969. In particular, 
if the prediction error of the first time series is reduced by including measurements from the second 
time series in the regression model, then the second time series is said to have a causal influence 
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on the first time series. However, not all prediction schemes are suitable to evaluate causality 
between two time series; they should be invariant w.r.t. independent variables, so that, at least 
asymptotically, they would be able to recognize variables without causality relationship. 

In this work we consider as predictor the function minimizing the risk and we show that it 
does not always enjoy this property. In particular we show that if we constrain the minimization 
of the risk to suitable hypothesis spaces then the risk minimizer is IIV (stable under inclusion of 
independent variables). We limit our analysis to the case of quadratic loss function V (s, /(x)) = 



2.1 Unconstrained H 

If we do not constrain the hypothesis space, then 7i is the space of measurable functions for which 
R is well defined. It is well known (Papoulis, 1985) that the minimizer of (JTJ is the regression 
function: 



Hence the regression function does not change if y is also used for predicting s; the regression 
function is stable under inclusion of independent variables. 

2.2 Linear hypothesis spaces 

Let us consider the case of linear hypothesis spaces: 



(s-/(x)) 2 . 




Note that if y is independent of x and s then p(s\x) = p(s|x, y 



) and this implies: 




w = {/!/(*) 



wjx,w x e M. d } . 



Here, and in all the hypothesis spaces that we consider in this work, we assume that the mean 
value of the predictor and the mean of s coincide: 

E{s-w^} =0, (3) 

where E{-} means the expectation value. This can be easily achieved by adding a constant 
component (equal to one) to the x vector. Equation © is a sufficient condition for property IIV 
in the case of linear kernels. Indeed, let us consider the risk associated to an element of Ti: 

R [ w x] = J ( s — w x x ) 2 p( x ' s)dx.ds. (4) 
The parameter vector w*, minimizing the risk, is solution of the following linear system: 

£{xx T } w x = E{sx}. (5) 
Let us consider the hypothesis space of linear functions in the z = (x T , y T ) T variable: 

W' = {/'|/'(z)=w z T z,w z 6R^}. 
Writing w z = (wj, Wy ) T with w y G R 9 , let us consider the risk associated to an element of H': 

R' [w z ] = j (s - w^x - wjy) 2 p(x, y, s)dxdyds. (6) 
If y is independent of x and s then © can be written, due to Q, as: 

R' [wj = R [w x ] + J {w; y ) 2 p(y)dy. (7) 

It follows that the minimum of R' corresponds to w y = 0. In conclusion, if y is independent 
of x and s, the predictors /*(x) = w* T x and f'*(z) = w* T z, which minimize the risks (jlj) and 
(JHJ) respectively, coincide (i.e., /*(x) = /'*(x, y) for every x and every y). Moreover the weights 
associated to the components of the y vector are identically null. So the risk minimizer in linear 
hypothesis spaces is a IIV predictor. 



3 Nonlinear hypothesis spaces 

Let us now consider nonlinear hypothesis spaces. An important class of non linear models is 
obtained mapping the input space to a higher dimensional feature space and finding a linear 
predictor in this new space. Let be a non linear mapping function which associates to x e M d the 
vector 0(x) = (0i(x), 02( x ), 0/i( x )) T £ where 0i, 2 , 4>h are ^ fixed real valued functions. 
Let us consider linear predictors in the space spanned by the functions <pi for z = 1,2, h. The 
hypothesis space is then: 

H={/|/(x) = w x %),w x GR''}. 
In this space, the best linear predictor of s is the function /* ETi minimizing the risk: 



R [ w x] = J [s — w x T 0(x)) 2 p(x, s)dxds. 



Let us denote w* the minimizer of (jSJ). We first restrict to the case of a single additional new 
feature: let y be a new real random variable, statistically independent of s and x, and denote 
7'(z), with z = (x T ,y) T , a generic new feature involving the y variable. For predicting the r.v. s 
we use the linear model involving the new feature: 

/'(z)=w z V(z), 

where 0'(z) = (</>(x) T , 7 ; (z)) T and w z = (w x T ,t>) T has to be fixed minimizing: 

R' [wj = J (s - w x T 0(x) - W7'(x,?/)) 2 ]9(x, s)p(y)dyidyds. (9) 

We would like to have v — at the minimum of i?'. At this aim let us evaluate: 

dR' 

dv 

where 9/9|o means that the derivative is evaluated at v — and w x = w*, where w* minimizes 
the risk (JHJ). If dR'/dv\o is not zero, then the predictor is changed after inclusion of feature 7'. 



-2 y V(x,y) - w; T 0(x)j p(x, s)p(y)d-xdyds, 



Therefore dR'/dv\o = is the condition that must be satisfied by all the features, involving y, to 
constitute a IIV (stable) predictor. It is easy to show that if 7' does not depend on x, then this 
condition holds. More important, it holds if 7' is the product of a function 7(1/) of y alone and of 
a component fa of the feature vector </>(x): 



7 ; (x, y) = j(y)fa(x) for some i G {1, h}. 



(10) 



Indeed in this case we have: 



dR' 



dv 



-2 



i{y)p{y)dy 







because the second integral vanishes as w* minimizes the risk (JHJ) when only x variables are used 
to predict s. We observe that the second derivative 



d 2 R 



2 / (7'(x,y)) 2 p(x, s)p(y)dxdyds 



dv 2 

is positive; (w* , 0) remains a minimum after inclusion of the y variable. In conclusion, if the 
new feature 7' involving y verifies (jlUj) then the predictor f'*(z), which uses both x and y for 
predicting s, minimizing © and the predictor /*(x) minimizing (jHJ) coincide. This shows that the 
risk minimizer is unchanged after inclusion of y in the input variables. This preliminary result, 
which is used in the next subsection, may be easily seen to hold also for finite-dimensional vectorial 



3.1 Kernel induced hypothesis spaces 

In this section we analyze if our invariance property holds true in specific hypothesis spaces which 
are relevant for many learning schemes such as Support Vector Machines (Vapnik, 1998) and 
regularization networks (Evgeniou et al., 2000), just for citing a few. At this aim in order to 
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predict s, we map x in a higher dimensional feature space Ti by using the mapping: 

0(x) = ( v /a7Vi( x ), v^Mx), Voh^h(x), ...), 

where ctj and ^ are the eigenvalues and eigenfunctions of an integral operator whose kernel 
K(x, x') is a positive definite symmetric function with the property if(x, x') = </>(x) T </>(x') (see 
Mercer's theorem, Vapnik, 1998). 

Let us now consider in detail two important kernels. 

3.2 Case X (x, x') = (l + x T x') p 

Let us consider the hypothesis space induced by this kernel: 

H={/|/(x)=w x V(x),w x Gl d | 

where the components 0«(x) of </>(x) are d! monomials, up to p — th degree, which enjoy the 
following property: 

</>(x) T 0(x') = (l+x T x'f. 

Let /*(x) be the minimizer of the risk in H. Moreover, let z = (x T ,y T ) T and consider the 
hypothesis space Ti' induced by the mapping 0'(z) such that: 

<//(z)V(z') = (l + z T z') p . 

Let f'*(z) be the minimizer of the risk in TC . If y is independent of x and s then /*(x) and f'*(z) 
coincide. In fact the components of (f>'(z) are all the monomials, in the variables x and y, up to 
the p — th degree: it follows trivially that 4>'(z) can be written as: 

0'(z) = (0(x) T , 7 '(z) T ) T , 
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where each component 7 4 '(z) of the vector -y'(z) verifies (|10|1. that is it is given by the product of 
a component (j)j(x) of the vector </>(x) and of a function 7i(y) of the variable y only: 



As an example, we show this property for the case of x = (£i,£2) T , z = {xi,%2,y) T and p = 2. 
In this case the mapping functions 0(x) and 0'(z) are: 



0'(z) = (1, V2xi, V2x 2 , \plx x x 2 , x\, xj, V2y, V2xiy, V2x 2 y, y 2 ) T , 

where one can easily check that 0(x) T 0(x') = (1 + x T x') 2 and <fi' (z) T <fi' (W) = (1 + z T z') 2 . In this 
case the vector 7'(z) is: 



According to the argument described before, the risk minimizer in this hypothesis space satisfies 
the invariance property. 

Note that, remarkably, the risk minimizer in the hypothesis space induced by the homoge- 
neous polynomial kernel K(pc, x ; ) = (x T x') p does not have the invariance property for a generic 
probability density, as one can easily check working out explicitly the p = 2 case. 

3.3 Translation invariant kernels 

In this section we present a formalism which generalizes our discussion to the case of hypothesis 
spaces whose features constitute an uncountable set. We show that the IIV property holds for 
linear predictors on feature spaces induced by translation invariant kernels. In fact let K(x, x') = 
iC(x — x') be a positive definite kernel function, with x, x' e IR d . Let K(u) x ) be the Fourier 



7*0) = 0jOO%(y)- 



0(x) = (1, V2xi, V2x 2 , \p2xxx-i-, xj, xl) T , 
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transform of K(x): K(x) <-> K{u> x ). For the time shifting property we have that: K(x — x') <-» 
K(uj x )e~ : >'*' x . By definition of the inverse Fourier transform, neglecting constant factors, we 
know that (Girosi, 1998): 

K(x-x') = f K(uj x )e- jw **'e jw * x du x . 
Being K positive definite we can write: 

K(x - x') = J y/kfaje?^* Ljk{u x )e^«A du x , 
where * indicates conjugate. Then we can write K(x., x') = (<f) x , (j)^) where: 



= V K{u 9 )e>»** (11) 

are the generalized eigenf unctions. Note that, in this case, the mapping function <p x associates 
a function to x, that is (f) x maps the input vector x in a feature space with an infinite and 
uncountable number of features. Let us consider the hypothesis space induced by K: 

H = {/|/(x) = (w x , <t> x ) , w x e W x } , 

where: 

(w x ,(f) x } = / w x {uj x )(j) x {uj x )d(jj xi (12) 

J«. d 

and W x is the set of complex measurable functions for which (J 12)) is well defined and real 1 . Note 
that w x is now a complex function, it is not a vector anymore. In this space the best linear 
predictor is the function / = (w x , <f) x ) in 7i minimizing the risk functional: 

^K] = E{(s - (w x ,(fr x )) 2 } 



1 In particular elements of W x satisfy w x (—u) x ) = w*(w a 
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It is easy to show that the optimal function w x is solution of the following integral equation: 

E{se-^A = ! w x (£ x )JK(£ x )<S>* x (u> x + £ x )dZ x , (13) 

where $, x is a dummy integration variable and & x (uj x ) = je-"^ x j is the characteristic func- 
tion 2 of the r.v. x (Papoulis, 1985). Let us indicate F(u} x ) = w x (uj x ) \J K(uj x ) and G(uj x ) = 
E {se jw * x |. Then (0 can be written as: 

where * indicates cross-correlation between complex functions. In the spatial domain this implies: 

G(x) = F*(x)p(-x). 

In conclusion, assuming that the density p(x) is strictly positive, the function w x {uj x ) minimizing 
the risk is unique and it is given by: 



w w (uJ)=F{G*(x)/p(-x)}/y/K(u> IB ). 
Substituting this expression into equation (fT2|) leads to 

/(x) = J sp(s\x.)ds, 

i.e. the risk minimizer coincides with the regression function. In other words, the hypothesis 
space 7i, induced by K, is sufficiently large to contain the regression function. This proves that 
translation invariant kernels are IIV. 

It is interesting to work out and explicitly prove the IIV property in the case of translation 
invariant and separable kernels. As in the previous section, let y e M. q be a r.v. vector independent 



2< 1 ) £C (— u x ) is the Fourier transform of the probability density p(x) of the r.v. x. 
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of x and s and use the vector z = (x T ,y T ) T for predicting s. At this aim, let us consider the 
following mapping function: 



= ^W^ 1 . (I 4 ) 



where uj z = (oj x t , ui y T ) T and (<// , <//,) = K'(z — z'). Let us consider the hypothesis space induced 
bjK': 

H' = {f\f(z) = (wz,^) ,w z eW z ] . 
The best linear predictor is the function /' = (w z , (p' z ) in 7i' minimizing the risk functional: 

R'[w z } = E[{s-{w z ,<j ) ' z )) 2 ], 

where the optimal function w z is solution of the integral equation (see (JEJ)): 

E {se-W*} = f w z (£ z )Jk'(£ z )$* z (u> z + £ z )d£ x , (15) 

where uj z = (uj x t , cj y T ) T . Note that, being x and y independent, the characteristic function of z 
factorizes: 

$ z (w z ) = <5> x (uj x )<S> y (u y ). 

If K'(z) is separable: 

K'(z) = K(x)H(y) (16) 

then its Fourier transform takes the form of K'(oj z ) = K(u x )H(ujy). Being i£ ^se -J ''* , *~ z | = 
E Ise-i"* x | E {e~ j "y y | , (JT3J) becomes: 

/ w z {$ z )^J kfa) J h {£ y ) $* (u w + ^)$;k + e w )de- (it) 

jRd+ q v v - y 
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The risk minimizer w z solution of (|T7|) is: 



= w 



5(Uy) 



(18) 



v 



x 



X 



This can be checked substituting (|TS|) in (fT7j) and using equation (fTB^l . The structure of eq.(JTHJ) 
guarantees that the predictor is unchanged under inclusion of variables y. This is the case, in 
particular, for the Gaussian RBF kernel. Finally note that a property similar to (jl(J|) holds true 
also in this hypothesis space. In fact, as K' is separable, (J14)) implies that: 



where 7 y (w y ) = y H(u)y)e :ju} y y with the property: ( / y y , r y y ') = H(y — y'). Eq. (fT9"j) may be seen 
as a continuum version of property (jl(J|) . 

4 Discussion 

In this work we consider, in the frame of kernel methods for regression, the following question: 
does the risk minimizer change when statistically independent variables are added to the set of 
input variables? We show that this property is guaranteed by not all the hypothesis spaces. We 
outline sufficient conditions ensuring this property, and show that it holds for inhomogeneous 
polynomial and Gaussian RBF kernels. Whilst these results are relevant to construct machine 
learning approaches to study causality between time series, in our opinion they might also be 
important in the more general task of kernel selection. Our discussion concerns the risk minimizer, 
hence it holds only in the asymptotic regime; the analysis of the practical implications of our 
results, i.e. when only a finite data set is available to train the learning machine, is matter for 
further research. It is worth noting, however, that our results hold also for a finite set of data, 
if the probability distribution is replaced by the empirical measure. Another interesting question 



(19) 
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is how this scenario changes when a regularization constraint is imposed on the risk minimizer 
(Poggio et al., 1990) and loss functions different from the quadratic one are considered. Moreover 
it would be interesting to analyze the connections between our results and classical problems of 
machine learning such as feature selection and sparse representation, that is the determination 
of a solution with only a few number of non vanishing components. If we look for the solution 
in overcomplete or redundant spaces of vectors or functions, where more than one representation 
exists, then it makes sense to impose a sparsity constraint on the solution. In the case here 
considered, the sparsity of w* emerges as a consequence of the existence of independent input 
variables using a quadratic loss function. 

The authors thank two anonymous reviewers whose comments were valuable to improve the 
presentation of this work. 
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